Search CORE

7 research outputs found

On the Compression of Recurrent Neural Networks with an Application to LVCSR acoustic modeling for Embedded Speech Recognition

Author: Alsharif Ouais
Bruguier Antoine
McGraw Ian
Prabhavalkar Rohit
Publication venue
Publication date: 02/05/2016
Field of study

We study the problem of compressing recurrent neural networks (RNNs). In particular, we focus on the compression of RNN acoustic models, which are motivated by the goal of building compact and accurate speech recognition systems which can be run efficiently on mobile devices. In this work, we present a technique for general recurrent model compression that jointly compresses both recurrent and non-recurrent inter-layer weight matrices. We find that the proposed technique allows us to reduce the size of our Long Short-Term Memory (LSTM) acoustic model to a third of its original size with negligible loss in accuracy.Comment: Accepted in ICASSP 201

arXiv.org e-Print Archive

Crossref

End-to-end text recognition with deep learning architectures

Author: Alsharif Ouais
Publication venue: McGill University
Publication date
Field of study

Accurate text recognition in documents was one of the milestones of machine learning and computer vision techniques. However, despite this early success, general text recognition still remains an unsolved problem. Since textual information is an artificial signal, designed to be simple to draw, it can be easily confused with other simple signals that naturally exist. Moreover, unlike in document text recognition, assumptions on the way text exists should be kept to a minimum in the general setting, creating the need for more robust detectors and recognizers. From a practical point of view, engineering an end-to-end system is an elaborate effort. It involves designing multiple modules from text detection to character recognition and integrating these models in a way that allows for scalability, modularity and high accuracy. That is why most of the previous works focused only on parts of the pipeline instead of the whole end-to-end system. Moreover, the most accurate previous works traded off accuracy with scalability, making them infeasible to use in real-world settings.This thesis attempts to address this issue, by showing how such an end-to-end system can be constructed with the high-level goals of balancing simplicity, accuracy and scalability. Drawing on connections to speech and handwriting recognition. Specifically, this thesis shows how the end-to-end problem can be dissected into three main sub-problems: character recognition, word recognition and text detection. Then, novel solutions to each problem are proposed, and a method for integrating the three modules together is shown. Technically, the system leverages a recent variant of convolutional neural networks that uses dropout and a max activation function. It also makes use of hybrid HMM models, that were shown to be useful in speech recognition problems. Empirically, the system's performance is measured in comparison to previous systems in terms of accuracy and scalability. Results show the proposed system outperforms previous state-of-the-art systems on benchmark datasets on all sub-problems. It also addresses scalability issues in lexicon size that previously proposed systems suffer from.La reconnaissance précise de texte dans les documents a été une pierre angulaire en apprentissage machine et vision artificielle. Toutefois, malgré ces premieres succès, le problème général de reconnaissance de texte demeure un problème non résolu. Puisque l'information textuelle est un signal artificiel conçu afin d'être facile à dessiner, il peut être facilement confondu avec d'autres signaux du même genre existants déjà dans l'environnement. De plus, à la différence de la reconnaissance de texte dans un document, les suppositions ayant trait à la manière dont le texte doit apparaître doivent demeurer minimales dans ce scénario plus général. Il faut ainsi développer des détecteurs et reconnaisseurs plus robustes. D'un point de vue pratique, l'élaboration de système de reconnaissance "du début à la fin" demande un effort considérable. Il faut non seulement concevoir de multiples modules de détection de texte et de reconnaissance des caractères mais aussi les intégrer d'une manière à permettre l'extensibilité, la modularité et la précision. C'est pour cette raison que les efforts précédents ont été dédiés seulement aux parties constituantes de cette chaîne plutôt qu'au système complet du début à la fin. De plus, ces approches ayant négligé l'extensibilité au profit de la précision ne peuvent être utilisées dans le monde réel. Cette thèse tente de résoudre ces problème et montre comment un système "du début à la fin" peut être conçu tout en répondant à l'idéal de simplicité sans toutefois compromettre la précision et l'extensibilité. Dans un même temps, cette thèse tente d'établir des liens avec la reconnaissance de voix et d'écriture. Plus précisément, cette thèse montre comment le problème du "début à la fin" peut être décomposé en trois sous-problèmes principaux: reconnaissance de caractères, reconnaissance de mots et reconnaissance de texte. Des solutions novatrices pour chacun de ces problèmes sont indépendamment présentées et son ensuite combinées en un seul système. Techniquement parlant, le système exploite une récente variation des réseaux neuronaux convolutionnels utilisant la technique de "dropout" et celle d'une fonction d'activation de type "max". Un modèle hybride HMM s'étant avéré utile en reconnaissance vocal est aussi utilisé pour notre problème. D'un point de vue empirique, la performance du système est évaluée en comparaison avec les systèmes précédents d'après les critères de précision et d'extensibilité. Les résultats démontrent que le système proposé s'avère supérieur aux autre systèmes de fine pointe lorsqu'il est évalué sur tous les sous-problèmes des ensembles de données. Le problème d'extensibilité est finalement résolu pour les lexiques dont la taille limitait les systèmes précédents

eScholarship@McGill